Skip to content

feat(scraper): add self-healing scraper heal command#11

Merged
meirk-brd merged 25 commits into
mainfrom
feat/scraper-self-healing
May 28, 2026
Merged

feat(scraper): add self-healing scraper heal command#11
meirk-brd merged 25 commits into
mainfrom
feat/scraper-self-healing

Conversation

@meirk-brd
Copy link
Copy Markdown
Collaborator

Summary

Adds bdata scraper heal <collector_id> "<prompt>" — AI self-healing for scrapers. When a scraper drifts (selectors move, a page redesigns), the agent fixes it in place instead of rebuilding, so the saved collector_id keeps working and improves.

  • One new command, scraper heal — the maintenance twin of scraper create: POST /dca/collectors/{id}/refactor_template → poll refactor_template/progress, reusing the existing async trigger→poll machinery (poll_until, build_ai_trigger_retry, SCRAPER_BODY_HINTS, the 429 backoff).
  • The agent is the detector. The CLI never guesses a scraper is "broken" (a heal is slow/billable/mutating). The agent inspects run output and decides. So run stays read-only — there is no --heal flag on run.
  • Closes the loop. On success heal emits a {collector_id, status, completed_steps, prompt, view_url, next_step} envelope (same shape as create), where next_step is a ready-to-run bdata scraper run <id> <url> verify command (--url bakes the real URL in). The intended agent flow is: run → inspect → heal → re-run → verify.
  • Failure is non-destructive. A failed heal leaves the existing scraper unchanged and still working; the recovery note says so (distinct from create's "half-built collector" wording).
  • Required <prompt> (≤1000 chars, validated fast); carries over --timeout, --max-retries/--no-retry, -o/--json/--pretty/--legacy-output/--timing/-k.

Built test-first; 25 new tests. Design + plan in docs/superpowers/.

Test Plan

  • pnpm type-check clean
  • pnpm build clean
  • src/__tests__/commands/scraper.test.ts — 131/131 pass (incl. validation, all failure paths, retry forwarding, legacy-output, progress-endpoint URL, exit codes, subcommand wiring)
  • bdata scraper heal --help shows args/options/examples; bdata scraper --help lists heal
  • bdata scraper run --help has no --heal option (design invariant)
  • Live run against a real collector (reviewer, with credentials)

Note: the repo's full suite has 8 pre-existing failures in browser/daemon/discover/scrape tests that also fail on main (0.3.0) — unrelated to this change. Agent-facing docs are in a companion PR on the brightdata-plugin repo (scraper-studio skill: Action 3, api-flow, recipe, common-mistake).

@meirk-brd
Copy link
Copy Markdown
Collaborator Author

Update: self-healing approval gate + scraper approve

Live e2e (the reason we held merge) uncovered that the self-healing AI flow is human-in-the-loop: it pauses at status: "pending_answer" / step: "user_approval" and never auto-completes. The prior heal polled that to timeout and reported a misleading error. This update fixes that and adds the agent-driven approval path the engineer's resume_automation_job endpoint enables.

What changed (11 + 1 commits on this branch since the original heal):

  • extract_progress_status now recognizes pending_answer → an awaiting_approval gate sentinel, so the shared poll stops at the gate instead of timing out.
  • scraper heal (default) stops at the gate: emits status: "awaiting_approval" with preview_result (sample rows the fix would produce) + a compact diff_summary, and a next_step pointing at the approve command. Exit 0 — the heal succeeded, it just needs a decision.
  • New scraper approve <collector_id> command (--reject to discard): POST /dca/collectors/{id}/resume_automation_job {message}, polls to done, hands back a scraper run verify next_step. Re-runnable if a heal needs multiple approvals.
  • scraper heal --auto-approve for the autonomous path (approve + poll to done in one command).
  • Shared resume_and_poll + emit_heal_terminal seams keep heal and approve consistent (incl. resume_failed vs poll_failed labeling).

Verification:

  • pnpm type-check, pnpm build clean.
  • src/__tests__/commands/scraper.test.ts: 152/152 (+21 tests).
  • Live e2e against the real API: healed a real collector → it stopped at the gate with the preview → scraper approve resumed it via the live resume_automation_job endpoint → polled to done → verify run returned data. The approvedone pipeline is confirmed end-to-end.
  • Invariants hold: scraper run stays read-only (no --heal/--approve); approve has no retry flags.

The 8 failing tests in the full suite are pre-existing (browser/daemon/discover/scrape) and present on main — unrelated to this change. Agent-facing docs updated in the companion skills PR (Action 4 + resume endpoint + recipe).

@meirk-brd meirk-brd merged commit 5e96df2 into main May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant